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ABSTRACT: 

Nowadays in the world of the Internet and the Web, great amounts of information in various forms and 
different subjects are available to users. The available information can be divided into three categories: 
structured, unstructured and semi-structured. Information retrieval systems traditionally retrieve 
information from unstructured text which is a text without marking up. XML retrieval is content-based 
retrieval of structured documents with XML. The aim of XML retrieval is restoring related parts of an 
XML document that by exploiting the document structure can respond to users' needs. In this research 
we will examine the XML retrieval. Moreover, models, challenges and retrieve methods exactly are 
studied. 
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I. INTROUDUCTION: 

By rapid development of using extensible language and XML development on the Internet, retrieval of 
XML data has become one of the most interesting research matters. Since the XML documents are increasingly 
expanding, engines for search and retrieval can be developed into a set of XML documents in order to perform 
the search. XML documents have not only textual information, but also contain information about the logical 
structure of the documents. The logical structure in fact is a tree-like structure that is encrypted by the XML 
labels. In XML retrieval, elements and components of document are retrieved, not the whole document. 
Content-based retrieval of XML documents over the past few years has been the most highly regarded which 
mainly has emerged from the NEXI initiative design [1]. The aim of XML retrieval is restoring related parts of 
an XML document that by exploiting the document structure can respond to users' needs [2]. Information 
retrieval systems are often inconsistent with relational databases. In XML retrieval, information needs of users 
determine as queries, includes key phrases and structured points. Structure, specifies XML elements tracks 
marked in the set from which system should restore the information [3]. In XML documents and texts, structure 
and content are separable [4]. An information retrieval system in response to a query returns a ranked list of 
documents. Then, user examine in the linear case each of them that are in a higher rank [5]. Since the numbers 
of XML components are generally high, it is necessary that users have systems to retrieve XML, so that 
components of content have became retrieved and reviewed. One approach could involve the use of 
summarization that is useful in interactive information retrieval. In interactive XML retrieval, a summary can 
connect by any one of its document parts which has returned via XML retrieval system [6]. 

II. THE STRUCTURE OF TEXTUAL INFORMATION 

Textual information based on the structure can be divided into three categories: 

2.1. Unstructured data: unstructured data means raw text, which through of markings and syntactic labels are 
separated. 

2.2. Structured data: structured data is including data that are already defined. In structured data the user can 
find out exact and specified respond from their needs. 

2.3. Semi-structured data: semi-structured data is between structured data and unstructured data and has 
stronger structure than unstructured data. We need to incorporate structured information in semi -structured data. 
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III. INEX 

INEX is an international association for the study of XML retrieval. Available approaches of XML retrieval for 
current structure in ranking and scoring elements are related to returning structures in memory and timing parameters. One 
approach only returns logical elements such as sections and paragraphs in the search results. Another approach allows users 
to specify their Structural preferences that consist of structural limitations [7]. INEX can be used in connection with the 
Xpath to retrieve the XML structural tracks based on what the user specifies in the query. On the other hand, by adding the 
function about ( ) expands its, which this function is used to filter components [8]. 

In INEX keywords are combined with structural adverbs. Thus, in response to a question, a ranked list of XML 
components presents that must contain the following conditions: 

1. At least comprising one of the keywords. 

2. Also has the considered adverbs [4]. 

3.1. Our track goals at INEX 

In INEX any response is studied to a target. Different track goals are as follows [9]: 





Adhoc Track 




Language Processin 




Interactive Track 




Multimedia Track 




Use Case Track 




Entity Ranking 




Book Search 




Link The Wiki 




Question Answering 



IV. CO AND CAS 

Users' information needs at INEX are expressed in two ways, CAS and CO. CO approach, shows key 
phrases based on an approach that is typically used for retrieval information on the Internet. CAS approach, is 
used a combination of structural and textual marks. In recent years, much work has been done in connection 
with the CAS that as four sub- tasks was implemented in 2005 : 

4.1. VVCAS: the target element and limitations of the support elements unclearly were studied. 

4.2. SVCAS: limitation of target element explicitly was examined but the limitation of support element is 
vaguely followed. 

4.3. VSCAS: target element and limitation of support element were considered vaguely but limitation of support 
element is explicitly followed. 

4.4. SSCAS: Both limitations of the target element and support element are explicitly considered. 

If structural remarks are generated in information needs, in order to demonstrate these two marks, two vague or 
explicit methods are represented [10]. CAS questions can be solved by analyzing the INEX expressions and 
decide which indexes used in search. The fundamental ways in analysis CAS questions include the vector space 
model, DMMS and display XML documents by frees [11]. CO questions are suitable for ordinary users with 
limited programming skills, and users to achieve the desired information do not need to learn the combination of 
complex questions from before Xquery and Xpath [12]. 

4.5. CAS questions are identified in three types: 

4.5.1. Routes based questions: route based questions are defined based on Xpath queries such as NEXI. 

4.5.2. Clause based questions: clause-centric questions are usually developed from Xquery language. 

4.5.3. Parts based questions: sections based questions used XML for the retrieval of XML documents [13, 14]. 

V. COMRANK SYSTEM IN XML RETRIEVAL : 

ComRank system is an Intermediary Search system used for automatic ranking in XML retrieval systems [15]. 
ComRank system have used a free approach for Intermediary Search so that its results obtained from several 
systems , its main systems have high ranking , furthermore its results are achieved from systems which compared 
with other systems have better operation [16]. Comrank is alike a voting system based on consensus [17]. 

VI. TREX: 

TREX is an XML retrieval system that can use of several summarized structures including the newly defined. TREX can 
itself manage great but small features, and thus accelerate the assessment of workload to the TOP-K questions .TREX has 
three methods of comprehensive retrieval, TA and integration. In TREX, summarized structure and reverse lists which are 
shown in the two tables are stored as the names of elements and sent lists. Evaluation of a NEXI query in the TREX is 
performed in the two ways of recovery and interpretation [18]. The TREX function in search engine TOPX is generalization 
of the markup function [19]. 
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VII. RE IN XML RETRIEVAL 

Relevant feedback is a technique that allows users to provide feedback on the initial search. The 
purpose of relevant feedback is that the user's needs express more precisely. RF approaches are proposed to 
XML retrieval. These approaches by adding extracted words from whole texts and documents enrich questions 
[20]. In fact, it can be said that relevant feedback is employed to improve results accuracy including extract 
keywords from documents. In RF, for ranking components from AQR algorithm, separate indexes for each 
component are created [22]. 

VIII. EVALUATION OF XML RETRIEVAL 

Evaluation of XML retrieval is determined by the member coverage and subject relevance. Member coverage is 
defined as follow in four ways: 

8.1. Exact coverage (E): The principal subject of component is searching for information which components 
are also. 

8.2. Small coverage (S): The principal subject of component is searching for information, but components are 
not meaningful units of information. 

8.3. Large coverage (L): Seeking information on components is presented, but is not the main issue. 

8.4. No coverage (N): searching information is not the components subject. 
Also, the dimension of subject relevance has four levels which are as follow: 
-Highly relevance with the number 3 is specified. 

-Relatively relevance with the number 2 is specified. 
-Slightly relevance with the number 1 is specified. 
-No relevance with the number 0 is specified. 

In subject relevance, components are judged in both dimensions and then judgment is combined in a letter - 
digits code. The composition of relevance coverage is specifying as follows: 



Q(re.F, cov) = < 



1.00 if (rd, cov) = 3E 

0.75 if (rd, cov) G {2E, 3L} 

0. 5 0 if ( re], cov) £ {IE, 2L,2S} 

0.25 if (relcov) € {1S,1L} 

0.00 if (relcov) = ON 



Formula l.the compositions of relevant c [23]. 

2S is a rather relevant part, i.e. it is so small. 2S component provides incomplete information, but answers the 
question trivially. 3E is a much related component that has much accurate coverage. An unrelated component 
cannot have precise coverage, so composition of 3N is impossible. 

The quantized Q function dose not imposes a dual selection of related / unrelated, and permits to categorize 
component as low relevance. Some related Components for retrieval set of A are calculated as follows [23]. 

#(the retrieval relevant cases)= £ CEA Q ( re ^ ( c X COr ( C )) 

Formula 2. relevant components in retrieval set[23]. 



IX. CHALLENGES IN XML RETRIEVAL : 

Challenges in XML retrieval are proposed as follows: 

■f parts of the document must be retrieved 

■f parts of the document that should be indexed 

■f nested element 

■f statistical terms 

■f heterogeneity model 

9.1. Parts of the document must be retrieved 

XML retrieval should return the following: 
■f parts of documents or XML elements 
■f All documents not return 
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Existing solutions to this challenge is to retrieve documents in a structured way which in fact a system should 
retrieve the certain part of a document. 

9.1. parts of the document that should be indexed 

This challenge in the unstructured retrieval, usually is straight, but in structured retrieval has four 
approaches including of total to detail, of detail to total ,indexing all elements, and lack of interaction in 
Pseudo-documents . Approach of total to detail is a two-step process that begins with the largest element as a 
indexing unit, leading to find sub-elements from each element. In this method relevance of a larger element is 
not necessarily a good predictor of the sub-elements contained in it. The method of detail to total, by 
considering all of the leaves select the most relevant leaves and expand them into larger units. The approach of 
indexing all elements, is the most strict approach. In approach of lacking interaction in Pseudo-documents, 
documents may be meaningless to the user since the units are not contiguous. 

9.3. Nested Elements 

In this challenge all elements that are small and are not relative leave aside and we keep the elemans which are 
useful for result. 

9.4. Statistical Terms 

This challenge has problem in distribution and can be trusted to estimate the frequency distribution of the 
documents. Calculating the idf term is available solution for the pair of XML documents. 

9.5. Heterogeneity Model 

It is in two ways of ideal and similar elements in different patterns. In Ideal case, only one model is needed that 
this model be realized for user. Similar elements are determined in different patterns fall into two different 
names and different elements in the structure. 

X. INDEXING OF XML RETRIEVAL 

Several indexing strategies for XML retrieval have been developed as follows: 

■f Element-based indexing: allow to each element which based on direct text and generation text, 

indexing is done. The indexing has one major drawback. Text that appears at the n th logical structure of XML, 
n-order indexing, thus requiring more index space [24, 25]. 

■f Only indexing leaf: only allow indexing leaves by the element or elements that are directly related to 

the leaves. 

■f Expanse-axis indexing: text in one continuous element, is used to estimate a statistical expression [26]. 

■f Selective indexing: includes removal of small elements and selective element type. 

■f Distributed indexing: separately for each type of element is created. Ranking model for each indicator 

separately runs and retrieves a list of ranked elements [22]. 



XI. RANKING PATTERNS OF XML RETRIEVAL 

The ranking patterns are chosen based on indexing strategies and the specific mechanisms, such as 
expansion and density that at them only leaf elements, are listed. Most of ranking methods create a list of 
elements with limited or no structural constraints on the associated element in question are ranked. 

Distribution or publication of scores for ranking items based on the curve of the leaves is used [27]. 
Then scores are published upwards to the parent. Ranking model should be applied to each indicator separately 
and retrieve ranked lists of elements [28]. 

XII. XML RETRIEVAL MODELS 

12.1. Language model 

This model combine estimations based on the whole text components and the compact expression 
components, as well as for improving efficiency and recovery, use from appearance a component in document 
and main text, and duration of that way. 

Sigurbjornsson by using a language model, evaluated different indexing strategies and for retrieving 
elements created four indicators: 

■f Indicator element with traditional overlying elements. 

■S Length based on the index, in which the elements of a length pre-set threshold crossed over and are just 

indexing. 

■f Index based on Qrel, where elements specified by heading elements to indexing. 
■f Section index, which also indexing other unoverying pages based on structure [30]. 
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12.2. The vector space model (VSM) 

Vector space model is the best and most efficient information retrieval models for retrieving 
unstructured documents [31]. Example of the vector space model is relationship- building tree techniques where 
the set of document is considered as a tree and documents are under the tree. Question is also a tree, and instead 
of returning a ranked list of documents from the elements, a ranked list of documents returned [32]. 

A simple measure of the similarity of the Cq in route search and route of Cd in a document, is the CR 
similarity function: 



~, , I !|,N if c a matches Ca 

0 if Cq does not match c d 



Formula 3. vector space model[32]. 

Where Cq and Cd are the number of the curves in the search path, and the document path respectively. 
The final score for a document is computed as a variable of the cosine measure that specifies by 
SIMNOMERGE, and be defined as follows: 



SimNoMerce [q, d) = £ £ Cr(c^q) £ weight^ i t c k ) — ™*W'UQ) == 

Ct GB c, G F t eV sjLceBjeV weight 2 ( it, t, c ) 



Formula 4. Final score for document[32]. 

Where V non-structural terms, B the set of all fields of XML and weight (q, t, c) and weight (d, t, c) are the 
weight of terms t in XML field to searching q and document d. 

12.3. Models based on okapi 

Let nE element, e = 1,2, nE are the C set. El is the length of element and the avel is the length of average 
element. Weight for query term j in document d in the collection c, e element is calculated by the following 
formula: 



iKi + D + fej H-dfj+O.S 



\V\ (e, d.. cO = — t - log 

k 4 a - b}+b ^r tf -i 

Formula5. Formula okapi [9] 



Where tf e j is equal to the frequency of query terms j in element e, dfj is the frequency of documents for query j 
and N specifies the number of documents [33]. 

Okapi to calculate the retrieval rate for an element x in a query q using the following formula: 
Formula 6. Okapi for calculating the retrieval rate for an element [9]. 

Where: 

M - efj + 0.5 
^ = [ ° e pf j + 0.5 
Where q is the length and efj is element frequency of the term j [34]. 

12.4. Logistic regression model 

The relevant probability in any document or document component is estimated according to a series of statistic 
in a set of document for a series of queries into a series of connected scales to statistics. 

The probability P (R I Q, C) to the log-odds of relevance LogO (R I Q, C) can be computed for any two events A 
and B is a deformation of simple probability P (A I B) / P (A 'I B ) is as follows: 
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logO{R\Q.. C)=&o+$>* 



P(R\Q,C] = 



1 -)- eio$0{R\Q,C} 



Formula 7. Logistic regression model [9]. 



f) D is the intercept term, bi coefficients statistics and Si is the S series. 



XIII. 



TREE MATCHING IN THE XML RETRIEVAL 



Many problems should be examined with retrieval systems in relation to the problem of tree matching 
and structured search. Documents may be very large in size and when the search is not selective, the response 
may be composed of many results. 

Xml document collection may include documents that are not specifically adapted to the structural 
search. Therefore, one of the key issues is how to choose the components that are approximately consistent with 
the limitations of search [35]. Tree matching algorithms are associated with the XML retrieval divided into two 
main sections. The first section covers the exact algorithms to find all patterns in a database XML. The second 
part describes and shows in detail approximation algorithm [36]. Branching pattern of the XML existing 
algorithms can be divided to the two-step algorithms of analysis approaches and one step algorithms of 
navigation approach. Evaluation of tree matching algorithms and approaches can be done in two ways for 
evaluating the performance and effectiveness. The exact tree matching approach is directly related to the 
efficiency while the approximate tree matching is more associated with effective [37]. 



At first, information retrieval was a matter for medical professionals, law, and library science. Users 
who worked less in secret, and more were seeking to study in the domain, were limited few via the companies of 
static documents and linguistic tools. But by appearence the era of information and expand use of the Internet, 
an abrupt mutation of the users number is developed, in general leading to the importance of discipline. 

The World Wide Web and the Internet have brought to us a huge flood of data flow and aspects of life. 
So, unprecedented demand for efficient techniques to handle the enormous amounts of data is available. 
Nowadays querying the data and extracting relevant documents, is not enough. Users want to focus more on 
certain information even the smallest details that are irrelevant. XML that is seemed as semi -structured data, a 
potential candidate to meet these requirements. 

This study is summarized on recent efforts in the field of XML retrieval. Also, the database community 
presents methods based on the use of traditional database techniques to XML data. Topics of interest include 
query languages such as SQL and referential data integrity problems. On the other hand, the IR community 
applies IR standard techniques with some variations to Centralized retrieval on the level of element. Despite 
some similarities with unstructured text, XML requires special attention, an aspect of determining relationship 
between the elements of the user's query and methods for its evaluation. 
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